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Abstract 

Background Subtraction (BS) is one of the key steps 
in video analysis. Many background models have been 
proposed and achieved promising performance on public 
data sets. However, due to challenges such as illumination 
change, dynamic background etc. the resulted foreground 
segmentation often consists of holes as well as background 
noise. In this regard, we consider generalized fused lasso 
regularization to quest for intact structured foregrounds. 
Together with certain assumptions about the background, 
such as the low-rank assumption or the sparse-composition 
assumption (depending on whether pure background frames 
are provided), we formulate BS as a matrix decomposition 
problem using regularization terms for both the foreground 
and background matrices. Moreover, under the proposed 
formulation, the two generally distinctive background as¬ 
sumptions can be solved in a unified manner. The opti¬ 
mization was carried out via applying the augmented La¬ 
grange multiplier (ALM) method in such a way that a fast 
parametric-flow algorithm is used for updating the fore¬ 
ground matrix. Experimental results on several popular BS 
data sets demonstrate the advantage of the proposed model 
compared to state-of-the-arts. 

1. Introduction 

Background Subtraction (BS) is often regarded as a key 
step in video analysis. In general, it is challenging to devise 
a good background model and some well-known challenges 
include: illumination changes, dynamic background, boot¬ 
strapping, camouflage etc. To meet these challenges, many 
works on BS have been proposed. In the following, we dis¬ 
cuss some related topics. 

Models of BS. From the representation perspective, most 
existing works could be categorized into two classes: pixel- 
wise modeling and frame-wise modeling. In the first cate¬ 
gory, representative methods model pixel-wise statistics of 
the background using mixture of Gaussian models (MoG) 


EHEiiini and neural network models mini etc. Non- 
parametric models are also proposed for improved effi¬ 
ciency m. The pixel-wise models are prone to resulting 
in fragmentary foregrounds, i.e. there are both “holes” in 
the foregrounds and false positive pixels from the back¬ 
ground. Whereas, the models of the second category, i.e. 
frame-wise models, usually achieve better performance by 
exploring structure information of the background. These 
works can be generally viewed as follow-ups of the cele¬ 
brated eigen-background model proposed in 1201 . of which 
the key assumption is that when camera motion is small, 
the matrix consists of background frames is approximately 
low-rank (201171. Hence, these models project video frames 
onto the subspace spanned by the eigen-vectors associated 
with the largest eigen-values of the matrix composed of all 
the frames of a video sequence. The recovered signal in 
the subspace is regarded as background and the residual is 
assumed to be foreground. Although structure information 
acquired by such holistic models helps to improve the in¬ 
tegrity of the recovered background, such improvement can 
be limited in many situations due to the neglect of fore¬ 
ground structural prior at the same time. Thus deliberate 
post-processing steps are often needed e.g. in (51 fTTl [T9l . 
Therefore, is there a way to quest for intact structured fore¬ 
grounds, which in turn can benefit background estimation? 
Learning in BS. In some scenarios, when pure background 
frames are available, learning background models can be 
achieved in a supervised manner. We call this situation 
the supervised model learning (SML) case. Many pixel- 
wise models belong to this category and they leam/update 
the models from given background pixels. Whereas frame- 
wise models, due to their blind decomposition origin of the 
eigen-background, tend to neglect this piece of information. 

In many other situations, foreground background coex¬ 
ist in each frame. We call this situation the unsupervised 
model learning (UML) case and it is more challenging than 
the SML case. In practice, however, pixel-wise models are 
learnt based on the frames ahead of the test frame, regard¬ 
less of such an existence of foregrounds. Consequently, in 
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the case of UML, the pixel-wise models are less robust com¬ 
pared to the frame-wise counterpart, since the latter can ex¬ 
ploit the holistic structure prior of background frames with¬ 
out knowing explicit background labels. 

Considering the two learning cases, is there a framework 
that unifies both SML and UML situations of BS? 

The Proposed Model. To address the above two questions, 
we devise a BS model that explicitly models the cohesion 
structure of the foregrounds in addition to the background 
structural prior, and propose a unified framework that solves 
both UML and SML cases of the BS problem. 

Notice that the foregrounds in a video sequence often 
correspond to meaningful objects such as people, cars, etc., 
therefore, the foreground pixels are usually both spatially 
connected as well as sparse if their sizes are relatively small 
w.r.t. the background scene. We realize such generic fore¬ 
ground structural priors by adopting an adaptive version of 
the generalized fused lasso (GFL) regularization in 1251 . 
GFL can be viewed as a combination of the h norm of both 
the variable values and their pairwise differences, i.e. the 
total variation (TV) penalty ca. By further modeling the 
connection/fusion strength between pixel pairs according to 
their similarity, (which is a strong prior in semantic seg¬ 
mentation 131), our foreground model can be considered as 
a flexible structural prior model without any pre-defined or¬ 
ganization of the pixels. Specifically, we denote each frame 
as a vector and the sequence of frames as a matrix con¬ 
catenating all the frame vectors. We assume that the ob¬ 
served matrix is a summation of a background matrix and 
a foreground matrix. Thus, by inducing a low-rank term of 
the background matrix and the GFL term for all foreground 
vectors, we formulate BS as a matrix decomposition prob¬ 
lem. In this way, the proposed model exploits structure in¬ 
formation from both the background and the foreground. 

To harness the availability of pure background frames in 
the SML situation, we derive a special case of the proposed 
formulation. This is done by explicitly adding constraints 
such that part of the observed matrix equals to the given 
background matrix. We further assume that the unknown 
background vectors of the testing frames lie in the span of 
the given background matrix, which is itself a low-rank ma¬ 
trix. In this way, we show that the resulted optimization is 
equivalent to a sparse estimation problem. 

From the perspective of optimization, the derived ob¬ 
jective and constraints form a new problem. We pro¬ 
pose an iterative algorithm by applying the augmented La¬ 
grange multiplier (ALM) method, which alternatively up¬ 
dates either the background matrix or the foreground ma¬ 
trix. When updating the background, singular value thresh- 
olding (SVT) (6l is applied for UML and fast iterative 
softhreshing (FISTA) m is applied for SML. While updat¬ 
ing the foreground, we solve the fused optimization with a 
fast parametric-flow algorithm ifTTIl . The idea behind this al¬ 


ternation is that, simultaneous estimation of the foreground 
and the background can reinforce each other. Indeed, exper¬ 
iments show that the proposed model achieves better than 
state-of-the-art performance on several popular data sets in¬ 
cluding both natural and synthetic videos. 

Related Works. In Q, the robust principle component 
analysis (RPCA) model was applied to solve the BS prob¬ 
lem. From the standpoint of BS per se, RPC A can be 
viewed as an extension of the eigen-background model 
where explicit sparse assumption of the foregrounds are 
taken into account, but not the connectedness. Here we 
introduce a stronger foreground model. In and 1^ . 
the group lasso (with overlap) regularization was applied to 
model the foregrounds, where the structure of foreground is 
assumed to be group sparse with predefined atomic group 
structures. These works reported improved performance 
over RPC A. However, in practice, experiments show that 
our model outperforms that of 12610 on all of the tested se¬ 
quences. This indicates that the adaptive GFL could be a 
more flexible foreground structural prior compared to group 
lasso. In particular. Figureshows such a comparison. 
Contributions. In summary, the contributions of this work 
are three folds. (1) We introduce an adaptive generalized 
fused lasso as a fiexible structural prior to modeling fore¬ 
ground objects in the background subtraction problem. We 
show that the performance of BS can be much improved by 
exploiting the structure information of both the foreground 
and the background. (2) We propose an effective algorithm 
to optimize the new objective function, i.e. constrained 
rank minimization with GFL, by extending the method of 
augmented Lagrange multiplier. (3) The proposed solu¬ 
tion to BS is a unified method which is able to solve both 
supervised and unsupervised learning cases depending on 
whether pure background frames are available, though they 
lead to different objectives. 

2. Proposed Background Subtraction Method 
2.1. Unsupervised Model Learning 

We start by introducing our model for the unsupervised 
model learning problem, i.e. UML. Given a sequence of 
n video frames, each frame is denoted as G 
i = 1,..., n. All data are concatenated into one matrix D G 
which is called the observation matrix. We assume 
that the observation matrix is the summation of a back¬ 
ground matrix and a foreground matrix, i.e. D = B + F, 
where B,F G are the background matrix and the 

foreground matrix, respectively. Therefore, by assuming 
low-rank of B and structured sparsity of F, we propose the 

^The model in O applied group sparsity to a trajectory representation 
of videos, instead of pixels we considered here. Therefore in the experi¬ 
ments, we focus on comparing with ED, which applied various of group 
sparsity to pixel representation and it is more recent than (S). 



following matrix decomposition objective, 
min rank(B) + A||F||,/, 

J 3 ,r 

s.t. D = B -j- F, 


( 1 ) 


where A > 0 is a tuning parameter (controlling the rela¬ 
tive contribution) and || • ||^// is the generalized fused lasso 
regularization defined as 


/e=l (i,i)eAr 

( 2 ) 

where f is the kth foreground vector and Af is the spatial 
neighborhood set, i.e. (i^j) G JV when pixel i and j are 
spatially connected. Due to the h penalties on each pixel 
as well as each adjacent pair of pixels, solutions of fs tend 
to be both sparse and spatially connected. Here Wij are in¬ 
troduced to enhance the conventional GFL model (25l such 
that Wij encode the strength of the fusion between neigh¬ 
boring pixels. In our model wij is defined as 




wij = exp 


2cr2 


( 3 ) 


where d is the pixel intensity. This definition of Wij makes it 
an adaptive weight encouraging spatial cohesion according 
to the associated pixels’ intensity in the observed images. 
To be specific, when we observe a large difference between 
two neighbouring pixels, there is a high probability that this 
pair of pixels belongs to different segments, therefore we 
decrease the fusion of this pair, a > 0 is a tuning parame¬ 
ter empirically set. When a ^ oo, all Wij = 1, the model 
reduces to the conventional GFL |[25l, where the fused term 
encourages pure spatial cohesion regardless of the pixel dif¬ 
ferences. When a 0, all Wij = 0, the model reduces to 
the RPC A model 171, where the foreground pixels are only 
assumed to be sparse. 

For ease of optimization, the convex nuclear/trace norm 
is often applied to relax the matrix rank. Thus in practice, 
the following surrogate is considered. 


min||B|U + A||F||,y, 
s.t. D = B -|- F, 


(4) 


where ||B||^ is the nuclear norm of matrix B, i.e. the sum 
of the singular values of B. 

2.2. Optimization via ALM 

Eq(|§ is a convex optimization problem. Off-the-shelf- 
solvers can be applied to solve it. However, when the di¬ 
mension of D is large (which is often the case in BS), more 
efficient algorithms have to be devised. Here we employ 
the augmented Lagrange multiplier method (Sj [161 to solve 


Algorithm 1 ALM algorithm for Eq. 0. 

1: Input: T> eW^'^,X> 0. 

2: Output: B,F e 

3: Initialization: Set Yq = 0, Bq = 0, Fq = 0, /xq > 0, 
/3 > 1 and ^max* 

4: while not converged do 

5: B/c+i = argminBL(B,F/e, Y/c,/i/c) 

6: F /,+1 = argminpL(B/,+i,F, Y/e,/i/e) 

7: Y/c+i = Yk + M/c(E) — B/c+1 — F/c+i) 

8- M/c+l ~ 5 Mmax} 

9: return B/e,F/e 


such an equality constrained optimization. We first formu¬ 
late the following augmented Lagrangian function 

i:(B,F,Y,M) = ||B|U + A||F||g/,+ 
{Y,D-B-F) + |||D-B-F|||, 

where \\’\\f is the Frobenius norm, Y is the Lagrangian 
multiplier and /i is an auxiliary positive scalar. Accord¬ 
ing to CD, the optimization problem in Eq. 0 can be 
solved by iteratively searching for the optimal B, F and Y 
to minimize Eq. 0- Under some rather general conditions, 
e.g. when is an increasing sequence and bounded, the 
searching process will converge Q-linearly to the optimal 
solution. We summarize the pseudo code in Algorithm 1 ' 
and discuss how to update B and F in each iteration. 
Updating B. We consider the following problem 


Bfc+i = argminL(B,Ffe, Yfe,/ife) 

B 

= argmin ||B||* + (Yfc,D - B - F^) + ^||D - B - Fk\\% 

B ^ 

= argmin L||b||^ + i||B - 

( 6 ) 


where Mi = D — F^ + — Y^. Eq. ([6| is a standard 
nuclear norm minimization problem, which is known to be 
fast solvable via Singular Value Thresholding (SVT) ||6l. 
According to ii, the solution to Eq. 0 is 

Bk+i = UTi(S)V^, where (U,S, V^) = svd(Mi). 

(7) 

Tt-(-) is an element-wise soft-thresholding operator, i.e. 
diag(T^(E;)) = [L(cri), tr(cr 2 tr(crr)] where L(cr) is 
defined as 


= sign(cr) max(|cr| — r, 0). (8) 

^Notice that Algorithm ^ is an approximate version of the original 
ALM. This approximation generally gives satisfactory results but con¬ 
verges much faster in practice GS). 












Updating F. Now we consider the updating of F 


2.3. Supervised Model Learning 


Fk+i = argminL(Bfe+i,F, 

F 

= argminA||F||g/i + (Yfc,D - B^+i - F) 

F 

+ ^||D-Bfe+i-F||| 

= argmin -^||F||g/, + ^||F - MsUl 
F M/e ^ 

= argmin^{^||f(')||i + ^ “ /j'^l 

J=i ii,j)ex 

(9) 

where M 2 = D—B/c+i + ^Y/. andm^^^ is the/-th column 
of M 2 . Notice that in Eq. the optimizations of each 
column are independent of each other. Therefore, solving 
Eq. (|^ equals to n times of solving the following problem 

f* = argminAi||f||i + A 2 ^ Wij\fi - fj\ + h\{ - in\\l, 

( 10 ) 

where Ai = ;^ and A 2 = ^. In order to solve Eq. (fTOl, ac- 

JJlL— 1—1 

cording to IIIQII . we introduce the following Lemma. 

Lemma 2.1. Suppose we have 

f = argminA 2 V Wij\fi - fj\+ ]-\\f - m\\l, (11) 

the solution to Eq. ([TOJ, i.e. f*, can be achieved by 
element-wise soft-thresholding such that /* = tx^{fi) for 
i = 

The proof can be shown by exploring the optimality con¬ 
dition of Eq. ( p^ and (TT\ . We provide the sketch of the 
proof as follows. A rigorous proof can be found in Co). 

Proof We define the objectives in Eq. and O as 
^i(f) and ^ 2 (f) respectively. Since f is the optimizer of 
^ 2 (f), it satisfies = 0 (sub-gradient is applied 

where necessary). Because the additional ||f||i term in 
Eq. ( p^ is separable with respect to fi, after applying the 
element-wise soft-thresholding e.g. /* = txiifi)^ the re¬ 
sulted f* satisfies dgi{f)/df = 0. □ 

Due to Lemma [2T| we can first solve Eq .{Tg and then 
use such an element-wise soft-thresholding technique to fi¬ 
nally solve Eq ( p^ and therefore update F. Notice that Eq. 
O is a continuous total variation formulation. In 12511^ . 
it is shown that Eq ( pTj ) is equivalent to a parametric graph- 
cut problem which can be efficiently solved via fast fiow 
algorithms such as the parametric-flow proposed in ifTTIl . 


In the situation where pure background frames are given 
(i.e. SML), we can of course still apply the same method 
above for background subtraction. However, by doing so, 
we do not fully exploit the provided information about the 
background. To utilized such extra information, we derive 
a variant of the model introduced above. 

We separate the observation matrix D as D = [Di, D 2 ], 
where Di is the matrix of all pure background frames 
(the training data) and D 2 is the matrix containing the rest 
frames with mixed content. The unknown B and F are 
separated correspondingly. We assume Di=Bi and thus 
Fi=0. By applying them to Eq. ([T]), we have 


min rank([Bi,B2]) + A||F2||g/i 
B,F 

S.t. Di=Bi, D2=B2+F2, 


( 12 ) 


We now make another assumption that rank([Bi, B 2 ]) = 
rank(Bi). The idea behind this assumption is that if we 
have enough pure background frames, the corresponding 
background vectors fully span the background subspace. 
By taking this assumption, the columns of the unknown 
B 2 can be represented using linear combinations of the 
columns of Bi. Specifically, we have B 2 = BiS = DiS, 
where S is the coefficient matrix. Thus, Eq. 0 becomes 


min rank(Di[I,S]) + A||F 2 ||g/; 
Ul ,!l5,r 2 

S.t. D2 = DiS + F2. 


(13) 


Interestingly, since Di is observed/given and its rank is ir¬ 
relevant to the optimization. As before, we assume Di to 
be low-rank, therefore there must exists a sparse S. This is 
because each column of B 2 can be represented as a linear 
combination of a small number of the columns of Di (given 
that Di is low-rank). So we can instead propose to solve 

min||S||i+A||F 2 ||g/, 

S.F 2 Q 4 ) 

s.t. D 2 = D^S + F 2 , 

where || ■ ||i is a convex surrogate for || ■ ||o, which counts 
the number of non-zero entries. 

Eq. ^ is our SML BS model. Since it is again 
an equality constrained optimization, the ALM introduced 
above can still be applied. This time, when updating the 
foregrounds, the optimization is the same as before, except 
that we are now dealing with F 2 instead of F. While updat¬ 
ing the background, we solve the following problem 

Sfc+i = argminT||s||;^ + MdiS - M|||.. (15) 

s M/c ^ 

Eq. ([Tg can be further decomposed into column-wise 
optimization and each of which is a standard Lasso 12^ 



iter 

Figure 1. Alternated updating of the background and the foreground. In each iteration (iter) either the background model or the foreground 
is updated and the objective value (the green plots) keeps decreasing until convergence. 


problem. Many fast algorithms can be applied e.g. the 
FISTA algorithm proposed in m 

In summary, we can effectively solve both the UML and 
SML BS models by applying the ALM algorithm described 
in Algorithmic Detailed updating rules for both the back¬ 
ground B and the foreground F are given above. Interest¬ 
ingly, although ALM is a general optimization method, its 
application to BS helps us to understand how our model 
alternately pursues and refines the background and the fore¬ 
grounds. In Figure [C we visualize the estimation in each 
iteration of ALM. We observe that the foreground estima¬ 
tion becomes better as the iteration goes on. This is mainly 
due to the synergy of the background estimation and fore¬ 
ground estimation. 

3. Experiments 
3.1. Data sets 

We test our model on three popular BS data sets, namely, 
the WallfioweijC data set 1^ , the lQ data set ca and 
the SABS|C data set 0 All together, there are 17 se¬ 
quences of both natural and synthetic videos. Most well- 
known BS challenges are presented in these sequences, e.g. 
gradual/sudden illumination changes, moving background, 
bootstrapping, camouflage, and occlusion etc. We give a 
brief introduction of these data sets respectively as follows. 

• “Wallflower”: The Wallflower data set consists of 
7 natural video sequences representing different BS 
challenges. The resolution of the frames is about 
160 X 120. Manually labeled ground truth are pro¬ 
vided. Most of the sequences have pure background 

^http://research.microsoft.com/en-us/um/people/jckrummAVallfl- 
ower/testimages. htm 

^http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html 

^http://www.vis.uni-stuttgart.de/forschung/informationsvisualisierung- 

und-visual-analytics/visuelle-analyse-videostroeme/sabs.html 


frames, which can be used for SML. 

• “Li”: The Li data set consists of 9 natural video se¬ 
quences. The resolution of the frames is about 176 x 
144. Manually labeled ground truth are provided. Part 
of them have pure background frames. 

• SABS: The SABS data set is a synthetic data set and 
therefore provides high quality ground truth. The res¬ 
olution of the frames is 800 x 600. Several BS chal¬ 
lenges are synthesized to the same scene. Following 
dll, only the basic sequence is used. Pure background 
frames are available. 

All three data sets are popular public data sets. Results 
of many existing models have been reported based on these 
sets. In order to evaluate the proposed model, we directly 
compare with the results reported in the respective papers. 

3.2. Comparison with RPCA 

Recall that the proposed model in the UML case can be 
reduced to RPCA when the model parameter Wij = 0 in 
Eq|^ (Section |2. 1[ ). Therefore, we first show how the pro¬ 
posed model improves performance over the RPCA model 
due to GFL foreground modeling. 

Quantitative comparisons on all the sequences of the 
three data sets are shown in Table [2[^ From the compar¬ 
isons we can see that the proposed model consistently out¬ 
performs RPCA. Qualitatively, we use the same 200 frames 
of the airport sequence in “Li” data set as reported in 171 to 
construct a head-to-head comparison, where we apply both 
RPCA and our model for background subtraction. In Figure 
1^ we illustrate the BS results of the test frame used in ID 
In practice, even after fine-grid search for the best parame¬ 
ters, the detected foregrounds of RPCA have more “holes” 
and more false positives from background than those of the 
proposed model. (Some obvious examples are marked by 











Table 1. Brief summaries of the models compared. (Part of the descriptions are from EIISl) 


methods 

notation 

description 

m 

KDE 

Kernel density estimation (KDE) with a spherical kernel. Uses a stochastic history. 

m 

G-KDE 

Neural network variant of Gaussian KDE. 

d 

C-KDE 

Codebook based; almost KDE with a cuboid kernel. 

d 

Hist 

Histogram based, includes co-occurrence statistics. Lots of heuristics. 

Oil 

Map 

Uses a self organising map, passes information between pixels. 

ED 

MoG 

Classic MoG approach. Assigns mixture components to bg/fg. 

fm 

R-MoG 

Refinement of (211. Has an adaptive learning rate. 

GqI 

Eigen 

Eigenbackground. 

d 

Gauss 

unimodal (Gaussian) 

(m 

D-MoG 

Dirichlet process Gaussian mixture Model. 

Q 

RPCA 

Robust PC A model. 

(26) 

G-Lasso 

Online subspace learning with group lasso with overlap regularization. 


Table 2. Results for the SABS data set, given as the F-score. 


Gauss 

C-KDE 

Eigen 

Map 

KDE 

R-MoG 

MoG 

Hist 

RPCA 

G-Lasso 

Ours 

E-score .3806 

.5601 

.5891 

.6672 

.7177 

.7232 

.7284 

.7457 

1 .6483 

0.7326 

.7775* 


the red boxes in Figure |^(d). Note that these results are not 
post-processed.) Qualitative results of the whole sequence 
are provided on our webpag^ 



(a) test image 


(b) Our recovered background 



(c) Our detected foregrounds (d) RPCA detected foreground 


Figure 2. Results comparison of RCPA and our model for the air¬ 
port data from the Li data set. 


3.3. Comparison with State-of-the-Art 

A brief summary of all the models we compared can be 
found in Table We compare our model to these mod- 

^ http ://idm.pku. edu. cn/staff/wangyizhou/ 


els on all three data set^ Following the literature, for the 
“Wallflower” data set, mis-classified number of pixels is 
used as the evaluation criteria; for both “Li” and “SABS,” 
F-score (F) is used as the evaluation criterion. We put a “*” 
on the upper-right comer of the scores to indicate that the 
sequence is of the SML case. 

On Wallflower data set. We tested our model on all the 
seven sequences of this data set. In Table we provide 
quantitative comparisons, where our model achieved the 
least mis-classified number of pixels on five sequences and 
the second least on one sequence. Note, however, our model 
performed poorly on the sequence “CF”. The reason is that 
the foreground in “CF” occupies a large portion of the tested 
frame, which violates the prior assumption on foreground 
sparsity. The same failure happened to both the RPCA and 
G-Lasso models, since both of them also assume sparse 
foreground prior. In Figure we show the qualitative re¬ 
sults of our model on the seven sequences. 

On Li data set. We applied our model to all the nine se¬ 
quences of the data set. In Table we show quantitative 
comparisons, where our model achieved the highest F-score 
on all these sequences. Notably, in some sequences such as 
“lb”, “ap” etc., the improvements over the second best are 
more than 10%. On average, our model achieved an 8% F- 
score gain ahead of the second best model. The qualitative 
results of all nine sequences are shown in Figure]^ 

On SABS data set. Following (121, we apply our model 
to the “Basic” sequence and compared with the other mod¬ 
els on this representative sequence. The results of different 


^Note that, since we are using the results reported by respective papers, 
not all the models have results on every sequence. 


















(a) MO (b) TD (c) LS (d) WT (e) CF (f) BS (g) FA 

Figure 3. Results on Wallflower data set. From top to bottom: test images, the ground truth and the estimations of our model. 

Table 3. Results for Wallflower, given as the number of pixels that have been mix-classifled. 


methods 

MO 

TD 

LS 

WT 

CF 

BS 

FA 

Frame Difference 

0 

1358 

2565 

6789 

10070 

2175 

4354 

Mean+threshold 

0 

2593 

16232 

3285 

1832 

3236 

2818 

Block correlation 

1200 

1165 

3802 

3771 

6670 

2673 

2402 

MoG 

0 

1028 

15802 

1664 

3496 

2091 

2972 

Eigen 

1065 

895 

1324 

3084 

1898 

6433 

2978 

D-MoG 

0 

330 

3945 

184 

384 

1236 

1569 

RPCA 

0 

628 

2016 

1014 


1465 

2875 

G-Lasso 

0 

912 

1067 

629 


1779 

1139 

Ours 

0* 

418* 

686* 

166* 


795* 

192* 


models on an example frame (No. 448) are illustrated in 
Figure (The qualitative results on the whole sequence 
can be found on our webpage.) As is shown, our model al¬ 
most cuts a perfect foreground (including its shadow). In 
the ground truth, the shadow is not included, which makes 
the value of the F-score relatively low. However, this defi¬ 
nition of foreground may be controversial depending on the 
actually situations. Nevertheless, our model outperforms all 
the rest models on the test image. The average F-scores of 
all the models on the whole sequence are summarized in 
Tabled where our model is shown to have achieved the 
highest performance. 

Compare with group lasso. As mentioned in the related 
works of Section the group lasso regularization was ap¬ 
plied to modeling foregrounds of BS in . The authors 
used both “3x3 blocks group” and “coarse-to-fine super¬ 
pixel group” structures to pursue connected sparse fore¬ 
grounds. However, as can be seen from the above com¬ 
parisons e.g. Table |^[^&[^ and Figuretheir performance 
are not as good as those of the proposed model. In Figure 
we provide a close-up comparison with the deliberately 
pre-defined grouping of pixels for foreground modeling. It 
shows that the group lasso model generates artifacts of de¬ 
tected foreground objects due to inappropriate pre-defined 


group structure. This arguably indicates that, compared to 
(adaptive) GFL, the group sparse models may not be fiexi- 
ble enough for recovering arbitrary foreground shapes. 


(a) Test image (b) Ground Truth (c) Ours: 0.876 



(d) G-Lasso: 0.848 (e) RPCA: 0.802 (f) KDE: 0.803 



(g) Hist 0.782 (h) MoG: 0.819 (i) R-MoG: 0.807 

Figure 5. Results on the SABS data set. F-scores are shown. 





















(a) cam (b) ft (c) ws (d) mr (e) lb (f) sc (g) ap (h) br (i) ss 

Figure 4. Results on Li data set. From top to bottom: test images, the ground truth and the estimations of our model. 


Table 4. Results for Li, given as F-score. 


methods 

cam 

ft 

ws 

mr 

lb 

sc 

ap 

br 

ss 

mean 

Hist 

.1596 

.0999 

.0667 

.1841 

.1554 

.5209 

.1135 

.3079 

.1294 

.1930 

MoG 

.0757 

.6854 

.7948 

.7580 

.6519 

.5363 

.3335 

.3838 

.1388 

.4842 

Map 

.6960 

.6554 

.8247 

.8178 

.6489 

.6677 

.5943 

.6019 

.5770 

.6760 

D-MoG 

.7624 

.7265 

.9134 

.3871 

.6665 

.6721 

.5663 

.6273 

.5269 

.6498 

RPCA 

.5226 

.8650 

.6082 

.9014 

.7245 

.7785 

.5879 

.8322 

.7374 

.7286 

G-Lasso 

.8347 

.8789 

.9236 

.8995 

.6996 

.8019 

.5616 

.7475 

.6432 

.7767 

Ours 

.8386 

.9011 

.9424* 

.9592 

.8208 

.8500 

.7422 

.8476 

.7613 

.8515 



(a) G-lasso (3x3 (b) G-lasso (coarse-to- (c) Ours (adaptive 
block) fine superpixel) fused lasso) 

Figure 6. Different foreground regularization comparison. 


3.4. Discussions 

Computation. The algorithm does not take many iterations 
to converge, see e.g. Figure and in practice the aver¬ 
age number of iterations is about 10-20. Therefore, the ma¬ 
jor computational cost to pursue structured background and 
foregrounds in the mid-steps can be eased up by this few it¬ 
erations. Moreover, since the updating of the foreground are 
column-wise, the implementation can be highly paralleled 
in practice. The code can be downloaded on our webpage. 
SML vs. UML. Note that, in general when pure back¬ 
ground frames are available, like most of the sequences in 
the Wallflower dataset, we have reported the results of the 
SML model. Such a choice outperforms its unsupervised 
counterpart, e.g. with an improvement of 24 (for WT) to 
179 (for TD) pixels on the Wallflower dataset. However, 
this is not always the case. For example, in the “cam” 
sequence of the Li dataset, although there are pure back¬ 


ground frames, they seem to be less representative possi¬ 
bly due to some background changes. Then, the supervised 
model did not achieve obviously better results but still com¬ 
petitive, in this case: 0.8382 vs 0.8386. 

Comparison with Explicit Post-processing. Arguably, ex¬ 
plicit post-processing in BS e.g. on can be viewed as 
a special case of foreground modeling since these meth¬ 
ods are fundamentally using foreground structural priors to 
guide post-processing. Therefore, we carried out some ex¬ 
periments with the data used in csi, where MoG models 
are post-processed by a hole-filling method. In summery, 
our model achieved competitive or even better results, de¬ 
tailed comparisons can be found on our webpage. . 

4. Conclusion 

In this paper, we propose a method of background sub¬ 
traction by exploiting structure information of the fore¬ 
grounds to help background modeling. Our model works 
for both supervised and unsupervised learning paradigms 
and automatically pursue meaningful background and fore¬ 
grounds. To optimize the new objective function, we pro¬ 
posed an effective algorithm by extending the ALM, which 
alternatively updates the background and the foreground 
matrices. Experimental results show that the proposed 
model achieves better than state-of-the-art performance on 
several popular public data sets. 
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